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Abstract. Length-q substrings, or g-grams, can represent important 
characteristics of text data, and determining the frequencies of all q- 
grams contained in the data is an important problem with many appli- 
cations in the field of data mining and machine learning. In this paper, 
we consider the problem of calculating the non- overlapping frequencies 
of all g-grams in a text given in compressed form, namely, as a straight 
^^ ' line program (SLP). We show that the problem can be solved in 0{q^n) 

^*^ I time and 0{qn) space where n is the size of the SLP. This generalizes and 

• . greatly improves previous work (Inenaga & Bannai, 2009) which solved 

Yi ' the problem only for q = 2 in 0{n'^ logn) time and 0(n'^) space. 



1 Introduction 



In many situations, large-scale text data is first compressed for storage, and 
then is usually decompressed when it is processed afterwards, where we must 

^T) • again face the size of the data. To circumvent this problem, algorithms that 

work directly on the compressed representation without explicit decompression 

("^ ' have gained attention, especially for the string pattern matching problem [1], 

and there has been growing interest in what problems can be efficiently solved 
in this kind of setting [14, 17,7,16,8,6,4]. 

The non- overlapping occurrence frequency of a string P in a text string T is 
defined as the maximum number of non-overlapping occurrences of P in T [3] . 
. _ Non-overlapping frequencies are required in several grammar based compres- 

j3 i sion algorithms [13,2], as well as ... In this paper, we consider the problem of 

computing the non-overlapping occurrence frequencies of all g-grams (length-g 
substrings) occurring in a text T, when the text is given as a straight line program 
(SLP) [10] of size n. An SLP is a context free grammar in the Chomsky normal 
form that derives a single string. SLPs are a widely accepted abstract model 
of various text compression schemes, since texts compressed by any grammar- 
based compression algorithm (e.g. [18, 13]) can be represented as SLPs, and those 
compressed by the LZ-family (e.g. [19,20]) can be quickly transformed to SLPs. 
Theoretically, the length N of the text represented by an SLP of size n can be 
as large as 0(2"), and therefore a polynomial time algorithm that runs on an 
SLP representation is, in the worst case, faster than any algorithm which works 
on the uncompressed string. 



For SLP compressed texts, the problem was first considered in [8], where an 
algorithm for q = 2 running in 0(n^ logn) time and 0{n^) space was presented. 
However, the algorithm cannot be readily extended to handle q > 2. Intuitively, 
the problem for q = 2 is much easier compared to larger values of <?, since there 
is only one way for a 2-gram to overlap, while there can be many ways that 
a longer g-gram can overlap. In this paper wc present the first algorithm for 
calculating the non-overlapping occurrence frequency of all g-grams, that works 
for any q > 2, and runs in 0{q^n) time and 0{qn) space. Not only do we solve 
a more general problem, but the complexity is greatly improved compared to 
previous work. 

A similar problem for SLPs, where occurrences of g-grams are allowed to 
overlap, was also considered in [8], where an 0{\S\'^n'^) time and 0(71^) space 
algorithm was presented for q = 2. A much simpler and efficient 0{qn) time 
and space algorithm for general q > 2 was recently developed [6]. As is the case 
with uncompressed strings, ideas from the algorithms allowing overlapping oc- 
currences can be applied somewhat to the problem of obtaining non-overlapping 
occurrence frequencies. However, there are still difficulties that arise from the 
overlapping of occurrences that must be overcome, i.e., the occurrences of each 
g-gram can be obtained in the same way. but we must somehow compute their 
non-overlapping occurrence frequency, which is not a trivial task. 

For uncompressed texts, the problem considered in this paper can be solved 
in 0(|r|) time, by applying string indices such as suffix arrays. A similar problem 
is the string statistics problem [3], which asks for the non-overlapping occurrence 
frequency of a given string P in text string T. The problem can be solved in 
0(|P|) time for any P, provided that the text is pre-processed in 0(|r| log |r|) 
time using the sophisticated algorithm of [5]. However, note that the preprocess- 
ing requires only 0(|T|) time if occurrences arc allowed to overlap. This perhaps 
indicates the intrinsic difficulty that arises when considering overlaps. 

2 Preliminaries 

2.1 Notation 

Let i7 be a finite alphabet. An element of E* is called a string. The length 
of a string T is denoted by |r|. The empty string e is a string of length 0, 
namely, \e\ = 0. A string of length g > is called a q-gram. The set of q- 
grams is denoted by X''. For a string T — XYZ, X, Y and Z are called a 
prefix, substring, and suffix of T, respectively. The i-th character of a string T 
is denoted by r[i] for 1 < z < |r|, and the substring of a string T that begins at 
position i and ends at position j is denoted by T[i : j] for 1 < z < j < \T\. For 
convenience, let T[i : j] = e if j < i. Let T^ denote the reversal of T, namely, 
T^ =:T[N]T[N - 1]---T[1], where N ^ \T\. 

For an integer i and a set of integers A, let i Q) A = {i + x \ x £ A} and 
i e A = {i - X \ X € A}, li A = (b, then let i © A = i e A = 0. Similarly, for a 
pair of integers (x, y), let i(B {x,y) — {i + x,i + y). 



2.2 Occurrences and Frequencies 

For any strings T and P, let Occ{T, P) be the set of occurrences of P in T, i.e., 

Occ(r, P) = {k>Q\T[k:k+ \P\ - 1] = P). 

The number of occurrences of P in T, or the frequency of P in T is, | Occ{T, P)\. 
Any two occurrences ki,k2 € Occ{T, P) with fci < A:2 are said to be overlapping 
if fci + \P\ — I > k2- Otherwise, they are said to be non- overlapping. The non- 
overlapping frequency nOcc{T, P) of P in T is defined as the size of a largest 
subset of Occ{T, P) where any two occurrences in the set are non-overlapping. 
For any strings X, Y, we say that an occurrence z of a string Z in XY , with 
\Z\ > 2, crosses X and Y,iiie [\X\ - |Z| + 2 : \X\] D Occ{XY, Z). 

For any strings T and P, we define the sets of right and left priority non- 
overlapping occurrences of P in T , respectively, as follows: 

RnOcciTP)^[^ -^^Occ{T,P)^%, 

nnuccyi ,rj | { j} y RnOcc(T[l : i - 1] , P) otherwise, 

r n (T P\-l^ if Occ(T,P)==0, 

*^ ' '' \{j}\Ji + \P\-l®LnOcc{T[i + \P\:\T\],P) otherwise, 

where i = max Occ{T, P) and j = min Occ{T, P). For all k G RnOcc{T, P), it is 
trivially said that RnOcc{T[k : |T|], P) C RnOcc{T, P). It can be said to LnOcc 
similarly. Note that RnOcc{T,P) C Occ{T,P), LnOcc{T,P) C Occ{T,P), and 
LnOcciT, P) = \T\ - |P| + 2 e RnOcc{T^, P^). 

Lemma 1. nOcc{T,P) = \RnOcc{T,P)\ = \LnOcc{T,P)\ 

Proof. See Appendix. 

Lemma 2. For any strings T and P, and any integer i with 1 < i < \T\, let 
ui = maxLnOcc(r[l : z — 1],P) + \P\ — 1 and U2 = i — 1 + vahi RnOcc{T[i : 
|T|],P). ThennOcc(T,P) = \LnOcc{T[l : m], P)\ + nOcc{T[ui + l : U2-1],P) + 
\RnOcc{T[u2 : \T\],P)\. 

Proof. By Lemma 1 and the definitions of ui, U2, LnOcc and RnOcc, we have 

nOcc{T, P) 

= \LnOcciT[l : ui],P)| + \LnOcc(T[ui + 1 : |T|],P)| 
= \LnOcciT[l : ui],P)| + \RnOcc{T[ui + 1 : |T|],P)| 

= \LnOcc{T[l : ui], P)\ + \RnOcc{T[ui + l : U2-I], P)\ + \RnOcc(T[u2 : \T\],P)\ 
= \LnOcc(T[l ■.ui],P)\ + nOcc{T[ui + l : U2 - I], P) + \RnOcc{T[u2 : |P|],P)|- 

D 

We will later make use of the solution to the following problem, where oc- 
currences of g-grams are weighted and allowed to overlap. 



Problem 1 (weighted overlapping q-gram frequencies). Given a string T, an inte- 
ger q, and integer array w {\w\ = |r|), compute J2ieOcc(T p) ""^W ^'^'" ^^^ g-grams 
P e SI where Occ{T, P)^%. 

Theorem 1 ([6]). Problem 1 can he solved in 0{\T\) time. 

Proof. See Appendix. 
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2.3 Straight Line Programs 

In this paper, we treat strings described in terms of straight line programs 
{SLPs). A straight line program T is a sequence of assignments {A"i = expri, 
X2 — expr2, . . . , Xn — exprn}. Each Xi is a variable and each expri is an expres- 
sion where expri = a (a G Z"), or expri = X(Xr {i,r < i). We will sometimes 
abuse notation and denote T as {XiYi^i. Denote by T the string derived from 
the last variable Xn of the program T. Fig. 1 shows an example of an SLP. The 
size of the program T is the number n of assignments in 7". 
Let val{Xi) represent the string 



derived from Xi. When it is not con- 
fusing, we identify a variable Xi with 
val{Xi). Then, \Xi\ denotes the length 
of the string Xi derives, and Xi[j] = 
valiXi)[j], X,[j : k] = valiX,)[j : k] 
for 1 < j,k <\X,\. Let vOcc{Xi) de- 
note the number of times a variable 
Xi occurs in the derivation of T. For 
example, vOcc{X4) = 3 in Fig. 1. 

Both \Xi\ and vOcc{Xi) can be 
computed for all 1 < i < n in a total 
of 0{n) time by a simple iteration on 
the variables: \Xi\ = 1 for any Xi = 
a (o e E), and \X,\ = \Xi\ + |X,| for 
any Xi = XgXr. Also, vOcc{Xn) = 1 and for i < n, vOcc{Xi) = J2{'vOcc{Xk) \ 
Xk = XiX,} + j:{vOcc{Xk) I Xk = X,Xr}- 

We shall assume as in various previous work on SLP, that the word size is 
at least log |r|, and hence, values representing lengths and positions of T in our 
algorithms can be manipulated in constant time. 
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Fig. 1. The derivation tree of SLP T - 
{Xl ~ a, X2 = b,A'3 = A'iA'2,X4 : 
XiXz,X^ = X^Xi^Xf, ~ XiXz^Xj : 
XgA^s}, which represents string T ■ 
val{Xj) = aababaababaab. 



3 q-gram Non- Overlapping Frequencies on Compressed 
String 

The goal of this paper is to efficiently solve the following problem. 

Problem 2 (Non-overlapping q-gram frequencies on SLP). Given an SLP T of 
size n that describes string T and a positive integer q, compute nOcc{T, P) for 
all g-grams P £ S'^. 



If we decompress the given SLP T obtaining the string T, then we can solve 
the problem in 0{\T\) time. However, it holds that \T\ = 0(2"). Hence, in order 
to solve the problem efficiently, we have to establish an algorithm that does not 
explicitly decompress the given SLP T. 

3.1 Key Ideas 

For any variable Xi and integer fc > 1, let pre{Xi, k) — Xi[l : min{fc, \Xi\}] and 
suf{Xi,k) = Xi[\Xi\-mm{k,\X,\} + l : \X,\]. That is, pre{X„k) and suf{Xi,k) 
are the prefix and the sufiix of val{Xi) of length k, respectively. For all variables 
Xi, pre{Xi, k) can be computed in a total of 0{nk) time and space, as follows: 

{val{X^) if|X,|<fc, 

pre{Xi, k)pre{Xr, k - \Xi\) if X^ = XiXr and \Xi\ < k < \Xi\, 
pre{Xi, k) if X, = XiXr and k<\Xi\. 

suf{Xi,k) can be computed similarly in 0{nk) time and space. 

For any string T and positive integers q and j (1 < j < j + q— I < \T\), the 
longest overlapping cover of the g-gram P = T[j : j + q — 1] w.r.t. position j of 
T is an ordered pair loCq{T,j) = [h, e) of positions in T which is defined as: 

l^c,(T,,) = argmax 

(&,e) 



(e-b) 



{b, e) e Occ{T, P) X {{q - 1) ® Occ{T, P)), 
b < j < j + q — 1 < e, 
Vfce [b:e-q]r\Occ{T,P), 

[k + 1: minjfc + q - l,e - q + l}]r\ Occ{T, P) ^ 



Namely, loCq{T,j) represents the beginning and ending positions of the maximum 
chain of overlapping occurrences of q-gram T[j : j ~\-q—l] that contains position 
j. For example, consider string T = aaabaabaaabaabaaaabaa of length 21. For 
g = 5 and j = 9, we have Vcq{T,j) = (2, 16), since T[2 : 6] = r[5 : 9] = r[9 : 
13] = T[12 : 16] = aabaa. Note that T[17 : 21] = aabaa is not contained in this 
chain since it does not overlap with T[12 : 16]. 

Lemma 3. Given a string T and integers q,j, the longest overlapping cover 
i0Cq{T,j) can he computed in 0{\T\) time. 

Proof. Using, for example, the KMP algorithm [12], we can obtain a sorted list 
of Occ{T, T[j : j + g — 1]) in 0(|T|) time. We can just scan this list forwards and 
backwards, to easily obtain b and e. D 

For a variable Xi = X^Xj. and a position 1 < j < \Xi\ — q + 1, a longest 
overlapping cover (&, e) = ioCq{Xi,j) is said to be closed in Xi li q — 1 < b and 

e < \X^\ -q + 2. 



Theorem 2. Problem 2 can be solved in 0{q^n) time, provided that, for all 
variables Xi = XiX,- and j s.t. \Xi\ > q and niax{l, \Xi\ — 2{q — 1) + 1} < J < 
minllX^I + q — 1, \Xi\ — q + 1}, (&, e) ~ loCq{Xi,j) and nOcc{Xi[b : e],s) are 
already computed where s — Xi[j : j + q ~ 1]. 

Proof. Algorithm 1 shows a pseudo-code of our algorithm to solve Problem 2. 

Consider q-gram s = Xi[j : j+q—l] at position j for which (6, e) = loCq{Xi,j) 
is closed in Xi. A key observation is that, if (6, e) is closed in Xi, then (6, e) is 
never closed in X^ or Xr. Therefore, by summing up vOcc{Xi) ■ nOcc{Xi[b : e], s) 
for each closed {b, e) in Xi, for all such variables Xi, we obtain nOcc{T, s). Line 14 
is sufficient to check if {b, e) is closed. 

For all 1 < i < n, vOcc{Xi) can be computed in 0{n) time, and ti = 
pre{Xi,2{q — l))suf{Xi,2{q — 1)) can be computed in 0{qn) time and space. 
The problem amounts to summing up the values of vOcc(Xi) ■ nOcc{Xi[b : e], s) 
for each q-gram s contained in each ti, and can be reduced to Problem 1 on 
string z and integer array w of length 0{qn), which can be solved in 0{qn) time 
by Theorem 1. 

In line 15, we check if there is no previous position h (max{l, \Xi\ — 2{q — 
1) + 1} <h< j) such that Xi[h : h + q - 1] = Xi[j : j + q - I] hy loCq{Xi, h) = 

loCq{Xi,j), SO that we do not count the same q-gram more than once. If there 
is no such h, we set the value of Wi[k — \Xi\ + j] to vOcc{Xi) ■ nOcc{Xi[b : e], s). 
This can be checked in 0{q^n) time for all Xi and j. 

For convenience, we assume that T = val(Xn) starts and ends with special 
characters #^~^ and $''~^ that do not occur anywhere else in T, respectively. 
Then we can cope with the last variable Xn as described above. Hence the 
theorem holds. D 



3.2 Computing Longest Overlapping Covers 

In this subsection, we will show how to compute longest overlapping cover {b, e) = 
locq {Xi , j) where s ~ Xi [j : j + q—l] for all Xi and all j required for Theorem 2. 
For any string T and integers q and j {1 < j < q), let 



l0Cq(T,j) = 
«OCg(T,j) = 



(j,6e) ifj + g-l<|Tl, 
(j, |r|) otherwise, 

\eb,\T\-j + \) if |T|-j-g + 2>l, 
(l,|r|-j + l) otherwise. 



where (.7, be) = (j - 1) ® \^Cq{T\j : |r|], 1) and [eb, \T\ - j + 1) == l^g(T[l : 
|r|— j + 1], |r|— j' — (7 + 2). Namely, loCq{T,j) is a suffix of the longest overlapping 
cover of the g-gram T[j : j + q — l] that begins at position j {1 < j < q) 
in T, and loCq{T,j) is a prefix of the longest overlapping cover of the q-gram 

T[\T\ -j-q + 2:\T\-j + l] that ends at position \T\ - j + 1 in T. 



Algorithm 1: Computing q-gram non-overlapping frequencies from SLP 

Input: SLP T — {Xi}f^i representing string T, integer q > 2. 
Output: nOcc(T,P) for all g-grams P e X"* where Occ{T, P) / 0. 

1 Compute vOcc{Xi) for all 1 < j < n; 

2 Compute pre{Xi, 2(g — f )) and suf{Xi, 2{q — 1)) for all 1 < z < n — 1; 

3 z -^ £; w •<— []; 

4 for i <— 1 to 71 do 

5 if |Xi| > q then 

6 let X, = XeXr-, 

7 k^\suf{Xi,2{q-l))\; 

8 U = W(^^, 2(g - l))pre{Xr, 2{q - 1)); 

9 2:.append(fi); 

10 Wi <— create integer array of length \ti\, each element set to 0; 

11 for J ^max{l,|X<|-2(q-l) + l} to minJIX^I +q - 1, |X,| -g+1} do 

12 s ^ Xi[j : j + q-l]; 

13 {b,e) ^kK:g{X„j); 

14 if g — 1 < 6 and e < |Xi | — q + 2 then 

15 if loCq{Xi, h) 7^ loCq{Xi,j) for any position h s.t. 
max{l, |Xf 1 - 2(g - 1) + 1} < ft < j then 

16 |_ Wi[k - \Xi\ +j] ^ vOcc{Xi) ■ nOcc{Xi[b : e],s); 

17 w.append(iiii); 

18 Calculate g-gram frequencies in z, where each g-gram starting at position d is 
weighted by w [d] . 



Lemma 4. For all I < i < n and 1 < j < 2{q 

in a total of 0{q^n) time. 



1), loCq{Xi,j) can be computed 



Proof. We use dynamic programming. Let Xi = XiXr, Pj ~ Xi[j : j + q—l], and 
assume loCq{Xi,j) and loCq{Xr,j) have been calculated for all 1 < j < 2{q — 1). 
We examine the string Xi[max{j, \Xe\ — q + 2} : min{|Xi|, \Xe\ + q — 1}] for 
occurrences of pj that cross Xi and Xr, obtain its longest overlapping cover 
(bi,ei), and check if it overlaps with loCq{Xi, j). Furthermore, let bbr be the left 
most occurrence of pj in Xr that has the possibility of overlapping with (bi, et). 
Then, locq{Xi,j) is either loCq{Xe, j), or its end can be extended to e^, or further 
to the end of loCq{Xr, bbr), depending on how the covers overlap. 

More precisely, let {j,be() = loCq{Xi,j), {bi,ei) = max{j — 1, l^cl — q + 
1} © l0Cq{Xi[ma.x{j,\X(\ - q + 2} : min{\Xi\,\Xe\ + q - 1}],^) where h G 
Occ{Xi[m.a.x{j, \Xi\ — q + 2} : min{|Xi|, \Xi\ + q — l}],pj), and {bbr, bcr) = 
{\Xi\ + k - 1) ® loCqiXr, k) where k = minOcc{pre{Xr,2{q-l)),pj). (Note that 
{bbr, bcr), {bi, Ci) are not defined if occurrences h, k oi pj do not exist.) Then we 



have 

{{j, hcf) if hei < bi or ^h, 

{j,ei) if bi < bci and (e^ < bbr or ^k) 

{j, bcr) otherwise. 

(See also Fig. 2 in Appendix.) For ah variables Xi we pre-compute pre{Xi, 2{q — 
1)) and suf{Xi, 2{q — 1)). This can be done in a total of 0{qn) time. Then, each 
loCq{Xi, j) can be computed in 0{q) time using the KMP algorithm, Lemma 3, 
and the above recursion, giving a total of 0{q^n) time for all 1 < i < n and 
l<J<2(g-l). D 

Lemma 5. For alll < i < n and 1 < j < 2(q — 1), loCq{Xi,j) can be computed 
in a total of 0{q^n) time. 

Proof. The proof is essentially the same as the proof for loCq{Xi,j) in Lemma 4. 

Recall that we have assumed in Theorem 2 that loCq{Xi,j) are already com- 
puted. The following lemma describes how loCq{Xi,j) can actually be computed 
in a total of 0{q^n) time. 

Lemma 6. For all variable Xi = XgXr and j s.t. max{l, \Xi\ — 2{q— 1) + 1} < 
j < min{|X^| + q~l,\Xi\ — q+l}, (6, e) = loCq{Xi,j) can be computed in a total 
of 0{q^n) time. 

Proof. Let Sj = Xi[j : j+q—1]. Firstly, we compute {bi, e,) — loCq{Xi[\Xi\ — 2{q— 

1) + 1 : min{|Xi|, |X£| + 2(g— 1)}], j') and then loCq{Xi,j) can be computed based 

on {bi,ei), as follows: Let (ebi, ecf) — loCq{Xi, \Xi\ — eeg + 1) and (66,-, 6e,-) = 

\Xi\®loCq{Xr, bbr~ \X(\), whcrc eee = maxOcc(Xi[max{l, \X(] — 2(g — 1) + 1} : 
\Xi\l.Sj) and bbr=mmOcciX,[\Xe\ + 1 : mm{\X,l\Xe\ + 2{q - I)}], s,). 

1. If bi < \Xe\ and e; > \Xi\, then we have b < bi < \Xi\ < e^ < e. 
(&,e) = loCq{Xi,j) can be computed by checking whether {ebg, eei), {bi,ei), 
and (66,., bcr) are overlapping or not. (See also Fig. 3 in Appendix.) 

2. If Ci < \Xe\, then trivially b = ebi and e = e^. 

3. If bi > \Xi\, then trivially b = bi and e = bcr. 

Each ecf = h and 66^ — \Xi\ + k can be computed using the KMP algorithm 
on string suf{Xi, 2{q ~ l))pre{Xr, 2{q — 1)) in 0{q) time. By Lemmas 4 and 5, 
{ebg, ecf) and (66^, be^) can be pre-computed in a total of 0{q^n) time for all 
1 < i < n. Hence the lemma holds. D 



3.3 Largest Left-Priority and Smallest Right-Priority Occurrences 

In order to compute nOcc{Xi[b : e], s) for all Xi and all j required for Theorem 2, 
where (6, e) = loCq{Xi,j) and s = Xi[j : j + q — 1], we will use the largest 



and second largest occurrences of LnOcc and the smallest and second smallest 
occurrences of RnOcc. 

For any set S of integers and integer 1 < k < \S\, let maxfe 5 and minkS 
denote the k-th largest and the k-th smallest element of S. 

For I < i < n and 1 < j < 2{q — 1), consider to compute max^ LnOcc{Xi[j : 
bei],pj) for fc = 1,2, where {j,bei) = loCq{Xi,j) and pj = Xi[j : j + q - 1]. 
Intuitively, difficulties in computing max^ LnOcc{Xi[j : bei],pj) come from the 
fact that the string val{Xi)[j : bci] can be as long as 0(2"), but we only have 
prefix pre{Xi,3{q — 1)) and suffix suf{Xi,3{q — 1)) of val{Xi) of length 0{q). 
Hence we cannot compute the value of fee,; by simply running the KMP algorithm 
on those partial strings. For the same reason, the size of LnOcc{Xi[j : bei\^Pj) 
can be as large as 0{2"/q). Hence we cannot store LnOcc{Xi[j : bei],pj) as 
is. Still, as will be seen in the following lemma, we can compute those values 
efficiently, only in 0{q^n) time. 

Lemma 7. For all variable Xi = X^Xr and 1 < j < 2{q — 1), let (j, bei) = 
l0Cg{Xi,j), pj = Xi[j -.j + q-l]. 

We can compute the values niaxi LnOcc{Xi[j : bei],pj) and 1110x2 LnOcc{Xi[j 
bei],Pj) 

for all 1 < i < n and 1 < j < 2{q — 1), in a total of 0{q^n) time. 

Proof. Sec Appendix. 

The next lemma can be shown similarly to Lemma 7. 

Lemma 8. For all variable Xi = XgX^ and 1 < j < 2{q — 1), let (eb, ee) = 
loCq{Xi,j), and Sj ^ Xi[\Xi\ — j — q + 2 : \Xi\—j + l\. We can compute the values 
miini RnOcc{Xi[eb : ee],Sj) and Tiun2 RnOcc{Xi[eb : ee]., Sj) for all 1 < i < n 
and 1 < J < 2{q — 1), in a total of 0{q^n) time. 

Lemma 9. For all variable Xi = XeXr and I < j < q, max LnOcc{Xi[ebi : 
eci], Sj) can be computed in a total of 0{q^n) time, where {ebi, eci) = loCq{Xi,j) 

and Sj = X^[\X^\ - j - q + 2 : \Xi\ - j + 1]. 

Proof. The lemma can be shown by using Lemma 7. Sec Appendix for details. 

Lemma 10. For all variable Xi = XiXr and I < j < q, mill RnOcc{Xi[bbi : 
bei],Pj) can be computed in a total ofO{q^n) time, where (bbi, bci) = loCq{Xi, j) 
and Pj = Xi[j : j + q-l]. 

Proof. The lemma can be shown in a similar way to Lemma 9, using Lemma 8 
instead of Lemma 7. D 

3.4 Counting Non-Overlapping Occurrences in Longest Overlapping 
Covers 

Firstly, we show how to count non-overlapping occurrences of q-gram pj in Xi [j : 
bci], for all i and j, where pj — Xi[j : j + q — 1] and (j, bci) = loCq{Xi,j). 



Lemma 11. For all variable Xi = XiX,. and 1 < j < 2{q — 1), let (j, hti) = 
loCq{Xi,j) and pj = Xi[j : j + q — 1]. We can compute nOcc{Xi[j : bei],pj) for 
all 1 < i < n and 1 < j < 2(q — 1), in a total of 0(q^n) time. 

Proof. By Lemma 1, we have nOcc{Xi[j : bei],pj) ~ \LnOcc{Xi[j : bei\,pj)\. 
We compute the occurrence bi in (j — 1) © LnOcc{Xi[j : bei],pj) that crosses 
Xi and Xr, if such exists. Note that at most one such occurrence exists. Also, 
we compute the smahest occurrence bbr in {j — 1) ® LnOcc{Xi[j : bei\,pj) that 
is completely within Xr. Then the desired value nOcc{Xi[j : bei],pj) can be 
computed depending whether 6,; and bbr exist or not. 

Formally: Consider the set S == {{j-l)®LnOcc{Xi[j : bei],pj))r][\Xt\-q+2 : 
|Xf I] of occurrence of pj which is cither empty or singleton. If S is singleton, 
then let bi be its single element. Let bbr — min{fc | k G ((j — 1) © LnOcc{Xi[j : 
be^],pj))r][\Xt\ + 1 : \Xe\ +q- l],if 36, then k > bi+q}. 

Then we have 



nOcc{Xi[j : bei],Pj) 

nOcc{Xr[j - \Xi\ : bet - \Xi\],pj) 



< 



nOcc{Xi[j 
nOcc{Xi[j 
nOcc{Xi[j 
nOcc{X([j 



bei\,Pj) 

bei],pj) + 1 

bei\,pj) + nOcc{Xr[br : ber],Pj) 

bei],pj) + nOcc{Xr[br : ber],Pj) 



1 



if.?>l^£|, 
if ^bi and /Bbbr, 
if 3bi and ^bbr 
if ^bi and 366^) 
if 3bj and 3bbr, 



where {bbr, bcr) — loCq{Xr, bbr). 

For all variables Xi we pre-compute pre{Xi, 3(g — 1)) and suf{Xi, 3(g — 1)). 
This can be done in a total of 0{qn) time. If hi or bbr exists, |X^| — 3{q — 1) < 
j — 1 + ina:>i LnOcc{Xi[j : bee],j) < \Xi\ — q + 2. Then, each bi and bbr can be 
computed irom LnOcc{Xi[{j — l+max LnOcc{Xi[j : bei],j)) : \Xi\+3{q — l)],pj) 
running the KMP algorithm on string suf{Xi, 3{q — l))pre{Xr, 3(g — 1)). Based 
on the above recursion, we can compute nOcc{Xi [j : bci] , pj ) in a total of Oijf'n) 
time for all 1 < i < n and 1 < j < 2(g — 1). D 

The next lemma can be shown similarly to Lemma 11. 

Lemma 12. For all variable Xi = XiXr and 1 < j <2{q — \), let {ebi, eci) = 

loCq{Xi,j) and Sj = Xi[|^i| — j ~ q + 2 : \Xi\ — j + 1]. We can compute 
nOcc{Xi[ebi : eei],Sj) for all 1 < i < n and 1 < J < 2{q — 1), in a total of 
0{q^n) time. 

We have also assumed in Theorem 2 that nOcc(Xi[b : e],Sj) are already 
computed. This can be computed efficiently, as follows: 

Lemma 13. For all variable Xi ^ X(Xr and j s.t. m.m{l,\Xi\ — 2{q — 1) + 1} < 
j < min{|Xi| — q + 1, \X(\ +q— 1}, nOcc{Xi[b : e], Sj) can be computed in a total 

of 0{q^n) time, where {b, e) ~ loCq [Xi , j) and Sj ~ Xi [j : j + q — 1]. 
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Proof. We consider the case where max{l, |X^| — (7 + 2} < j < \Xe\, as the other 
cases can be shown similarly. Our basic strategy for computing nOcc{Xi [b : e], Sj) 
is as follows. Firstly we compute the largest element of LnOcc{Xi[b : e],Sj) 
that occurs completely within Xi. Secondly we compute the smallest element of 
RnOcc{Xi[b : e\,Sj) that occurs completely within Xr. Thirdly we compute an 
occurrence of Sj that crosses the boundary of Xg and X^, and do not overlap 
the above occurrences of Sj completely within X^ and X,,- 

Formally: Let ee^ = b + q — 2 + niaxOcc{Xi[b : |X^|],Sj), 66,, = |X^| + 
rmTiOcc{Xi[\Xi\ + 1 : e]., Sj), ui = b + q — 2 + max LnOcc{Xi[b : ee^], Sj), and 
U2 = bbr — 1 + min RnOcc{Xi[bbr : e], Sj). We consider the case where all these 
values exist, as other cases can be shown similarly. It follows from Lemmas 1 
and 2 that 

nOcc{Xi[b : e], Sj) 



\LnOcc{Xi[b : ui], Sj)\ + nOcc{Xi[ui + l ; U2 — I], Sj) + \RnOcc{Xi[u2 



e 



^j> 



= nOcc{Xi[b : eeg], Sj) + nOcc{Xi[ui + 1 ; M2 — 1], Sj) + nOcc{Xi[bbr : e], Sj), 

(See also Fig. 6 in Appendix.) 

By Lemma 6, (6, e) = loCq{Xi,j) can be pre-computed in a total of 0{q^n) 
time. Since b < eei and bbr < e, ee^ and bbr can be computed in 0{q) time 
using the KMP algorithm. By Lemmas 11 and 12 nOcc{Xi[b : ee^],Sj) and 
nOcc{Xi[bbr '■ e],Sj) can be pre-computed in a total of 0{q^n) time (Notice 
(6, ecf) = loCq{Xf,, eei) and {bbr,e) = \Xi\ © loCq{Xr,bbr — l-^^^l))- By Lem- 
mas 9 and 10, ui and U2 can be pre-computed in a total of 0{q^n) time. Hence 
nOcc{Xi[ui_ + 1 : U2 — l],Sj) can be computed in 0{q) time using the KMP 
algorithm for each i and j . The lemma thus holds. D 

3.5 Main Result 

The following theorem concludes this whole section. 

Theorem 3. Problem 2 can be solved in 0(q^n) time and 0{qn) space. 

Proof. The time complexity and correctness follow from Theorem 2, Lemma 6, 
and Lemma 13. 

We compute and store strings suf{Xi,3{q — 1)) and pre{Xi,3{q — 1)) of 
length 0{q) for each variable Xi, hence this requires a total of 0{qn) space for 
all 1 < i < n. We use a constant number of dynamic programming tables each 
of which is of size 0{qn). Hence the total space complexity is 0{qn). D 

4 Conclusion and Discussion 

We considered the problem of computing the non-overlapping frequencies for 
all g-grams that occur in a given text represented as an SLP. Our algorithm 
greatly improves previous work which solved the problem only for q ~ 2 requiring 
0(n^ log n) time and 0(n'^) space. We give the first algorithm which works for 
any q > 2, running in 0{q^n) time and 0{qn) space, where n is the size of the 
SLP. 
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Appendix 
A Proofs 

Proof of Theorem 1. 

Proof. Wc will make use of the suffix array and Icp array. 

The suffix array [15] SA of any string T is an array of length |r| such that 
SA[i] ~ j, where T[j : \T\] is the i-th lexicographically smallest suffix of T. 
The Icp array of any string T is an array of length |r| such that LCP[i] is the 
length of the longest common prefix of T[SA[i - 1] : \T\] and T[SA\i] : \T\] for 
2<i< \T\, and LCP[l] = 0. 

It is well known that the suffix array for any string of length \T\ can be 
constructed in 0(|T|) time (e.g. [9]) assuming an integer alphabet. Given the 
text and suffix array, the Icp array can also be calculated in 0(|r|) time [11]. 

We can calculate the overlapping g-gram frequencies of string T using suffix 
array SA and Icp array LCP. SA[i] represents an occurrence of a g-gram T[S'A[i] : 
SA[i] + g — 1]. Since the sufhxes are lexicographically sorted in the suffix array, 
intervals on the suffix array where the values of Icp array are at least q represent 
occurrence of the same g-gram. The sum of ^[^^[i]] in this interval is the desired 
value for the g-gram. Constructing SA, LCP can be done in 0(|T|) time, and 
summing up 'u;[S'y4[i]] for each interval where LCP[i\ > g can easily be done in 
0(|T|) by a simple scan. D 

Proof of Lemma 1. 

Proof. Wc prove nOcc{T[l : i],P) = \LnOcc{T[l : i],P)\ by induction on i. For 
i < 1, the statement clearly holds. Now, assume that the statement holds for 
i < k, where k>2. For i = k, notice that < nOcc{T[\ ■.k],P)- \LnOcc{T[l : 
fc],^) < 1) since there can be at most one new occurrence of P ending at 
position i, which may or may not be counted for nOcc{T[l : k],P). If we assume 
on the contrary that the statement does not hold for i = k, then nOcc{T[l : 
k],P) - nOcc{T[l : fc - 1],P) = nOcc{T[l : k], P) - \LnOcc{T[l : k],P)\ = 1. 
Since the change was caused by the new occurrence, we have nOcc(T[l : k]) = 
nOcc{T[l : k — \P\]) + 1. By the inductive hypothesis, we have nOcc{T[l : k — 
\P\],P) = \LnOcc{T[l : k- |P|],P)|. Also, \LnOcc{T[\ : k],P)\ = \LnOcc{T[\ : 
fc— |P|],P)| + 1, since the new occurrence does not overlap with any occurrences 
in LnOcc{T[l : k - \P\]). This leads to nOcc{T[l : k]) = \LnOcc{T[l : k],P)\, a 
contradiction. nOcc{T, P) = |i?nOcc(T, P)| can be shown symmetrically. D 

Proof of Lemma 7. 

Proof. We compute the smallest occurrence bi in (j — 1) © LnOcc{Xi[j : bei],pj) 
that crosses Xg and Xr. Also, we compute the smallest occurrence bbr in {j — 
1) © LnOcc{Xi[j : bei],pj) that is completely within Xr. 
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Then the desired value maxi LnOcc{Xi[j : bei], pj) can be computed depend- 
ing whether hi and bbr exist or not. 

Formally, consider the set 5 = {{j — l)®LnOcc{Xi[j : bei\,pj))r\[\Xe\—q + 2: 
\Xi\] of occurrence of pj which is either empty or singleton. If S is singleton, 
then let bi be its single element. Let bb,- = min{fc | k G {{j — 1) © LnOcc{Xi[j : 
be.lpj)) r\[\Xi\ + l: \Xi\ + 2{q - 1)], if 36, then k > b, + q}. 

Then we have 

inaxiLnOcc{Xi[j : bei],pj) 

maxi LnOcc{Xi[j : bei\,pj) if ^bi and ^bbr 

bi — j + 1 if 3bi and ^bbr 

bb,. — j + muxi LnOcc{Xr[bbr — \Xi\ : ber],Pj) HBbb,. 

(See also Fig. 7 in Appendix B.) 

For all variables Xi we pre-compute pre{Xi, 3{q — 1)) and suf{Xi, 3{q — 1)). 
This can be done in a total of 0{qn) time. If bi or bbr exists, \X(\ — 3{q — 1) < 
j — 1 + max LnOcc{Xi[j : bei],j) < \Xi\ — q + 1. Then, each bi and bbr can be 
computed bom LnOcc{Xi[{j — 1 + ma.x LnOcc{X([j : bee],j)) : \Xi\+3{q — l)],pj) 
runnning the KMP algorithm on string pre{Xi, 3(g — l))suf{Xi,3{q — 1)). 

Based on the above recursion, we can compute maxi LnOcc{Xi[j : bei],pj) 
in a total of 0{q^n) time for all 1 < i < n and I < j < 2{q — 1). 

It is not difficult to see that similar claims, with slightly different conditions, 
can be made for max2 LnOcc{Xi[j : bei],pj) where the value corresponds to 
one of 4 values: m.ax2 LnOcc{Xi[j : bei],pj), m&xi LnOcc{Xi[j : bei],pj), bi, or 
max2 LnOcc{Xr[bbr — \Xi\ : ber],Pj), with appropriate offsets. D 

Proof of Lemma 9. 

Proof. Our basic strategy for computing ma,x LnOcc(Xi[ebi : eci], Sj) is as fol- 
lows. Firstly we compute the largest element of LnOcc{Xi[ebi : eei],Sj) that 
occurs completely within Xi. Secondly we compute the smallest element of 
LnOcc{Xi[ebi : eci], Sj) that crosses the boundary of Xi and Xr. Let d be this 
occurrence, if such exists. Then the desired output vaax LnOcc{Xi[ebi : ea], Sj) 
is given as either the largest or the second largest element of {d + q — 1) (B 
LnOcc{Xr[d + q-\Xi\ : \Xr\],Sj). 

More formally: We consider the case where ebi + q — 1 < \Xe\. Let ee^ = 
q - 1 + m&x{Occ{X,, Sj) n [\Xi,\ - 2(g - 1) + 1 : \Xi\ - q + 1]), m = ebi - 
1 + max LnOcc{X([ebi : eci], Sj) where {ebi, eef) = loCq{Xi, \Xi\ — eeg -f- 1). Let 
d = m-\-q — 1 + niin LnOcc{Xi[m + q : ee^], Sj). Let 

_{d \i eei-q + l<\Xf\ or d>\Xi\, 

I d-\-q—l-\-TamLnOcc{Xi[d-\-q : \Xi\\, Sj) otherwise. 

Let h' = |X^|-fmax2 LnOcc(Xr[bbr' : bcr'], Sj) and h ~ jX^ |-|-maxi LnOcc{Xr[bbr' 
ber'],Sj) where {bbr', bcr') ~ loCq{Xr,bbr~\Xe\). (See also Fig. 5 in Appendix B.) 
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Then 

r r, iv\i, 1 ^ )^ iih<ee,-q + l, 

max LnUcc(Xi[ebi : eei\,Sj) = < 

I h otherwise. 

The case where ebi + q — I > \Xg\ can be solved similarly. 

Each eeg, d and bbr can be computed in 0{q) time using the KMP algorithm, 
hence requiring a total of 0{q^n) time. By Lemmas 4 and 5, loCq{Xi, eei) and 
loCq{Xi, bbr) can be computed in 0(q^n) time for all Xi = XiXj. and 1 < j < n. 
By Lemma 7, h' and h can be computed in a total of 0{q^n) time for all Xi = 
XiXr and 1 < j < n. Therefore, by dynamic programming we can compute 
LnOcc{Xi[ebi : ee^], Sj) in a total of 0{q^n) time. D 
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B Figures 
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Fig. 2. Illustration for Lemma 4. In this figure, loCq(Xi,j) — (j, e^). 
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Fig. 3. Illustration for Lemma 6. Rectangles show important occurrences of Xi[j 
j + q — 1]. In this case b = ebe and e = fee^. 
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loc^{X,,j) 

Fig. 4. Illustration for Lemma 7, calculating max LnOcc[Xi[j : he\,pj). Shadowed oc- 
currences are not in LnOcc{Xi[j : bei],pj), while white ones are in LnOcc{Xi[j : 
bei],pj). 
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Fig. 5. Illustration for Lemma 9. Rectangles show important occurrences of Sj. In this 
case max Z/nOcc(Xi[e6i, eei], Sj) = ft', as ft > ee; — g + 1. 
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Fig. 6. Illustration for Lemma 13. Rectangles show important occurrences of Xi[j : 
j + q — 1]. In this case nOcc{Xi[b : eei], Sj) — 3, nOcc{Xi[ui + 1 : U2 — 1], Sj) = 1, and 
nOcc{Xi[bbr : e], Sj) = 3. 
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